NYC Green Taxi Analysis

In [1]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib

Question 1

  • Programmatically download and load into your favorite analytical tool the trip data for September 2015.
  • Report how many rows and columns of data you have loaded.
In [2]:
import pandas as pd
pd.options.mode.chained_assignment = None
from __future__ import division
green = pd.read_csv('https://storage.googleapis.com/tlc-trip-data/2015/green_tripdata_2015-09.csv')
In [3]:
green.head()
Out[3]:
VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude Dropoff_latitude Passenger_count ... Fare_amount Extra MTA_tax Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount Payment_type Trip_type
0 2 2015-09-01 00:02:34 2015-09-01 00:02:38 N 5 -73.979485 40.684956 -73.979431 40.685020 1 ... 7.8 0.0 0.0 1.95 0.0 NaN 0.0 9.75 1 2.0
1 2 2015-09-01 00:04:20 2015-09-01 00:04:24 N 5 -74.010796 40.912216 -74.010780 40.912212 1 ... 45.0 0.0 0.0 0.00 0.0 NaN 0.0 45.00 1 2.0
2 2 2015-09-01 00:01:50 2015-09-01 00:04:24 N 1 -73.921410 40.766708 -73.914413 40.764687 1 ... 4.0 0.5 0.5 0.50 0.0 NaN 0.3 5.80 1 1.0
3 2 2015-09-01 00:02:36 2015-09-01 00:06:42 N 1 -73.921387 40.766678 -73.931427 40.771584 1 ... 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
4 2 2015-09-01 00:00:14 2015-09-01 00:04:20 N 1 -73.955482 40.714046 -73.944412 40.714729 1 ... 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0

5 rows × 21 columns

In [4]:
green.shape
Out[4]:
(1494926, 21)

There is 1,494,926 rows and 21 columns from this NYC Green Taxi Trip data in September 2015

Exploratory Analysis and Data Cleaning

By looking through the data, some of the fields seems erroneous to me. For example Passenger_count, Trip_distance, Fare_amount, Extra, MTA_tax, Tip_amount, Tolls_amount, improvment_surcharge and Total_amount have values less than or equal to 0. There was also one trip that the total distance was 600 miles which seem suspecious to me! Since we have about 1.5 million rows loaded, I will simply remove these bad data.

In [5]:
list(green.columns)
Out[5]:
['VendorID',
 'lpep_pickup_datetime',
 'Lpep_dropoff_datetime',
 'Store_and_fwd_flag',
 'RateCodeID',
 'Pickup_longitude',
 'Pickup_latitude',
 'Dropoff_longitude',
 'Dropoff_latitude',
 'Passenger_count',
 'Trip_distance',
 'Fare_amount',
 'Extra',
 'MTA_tax',
 'Tip_amount',
 'Tolls_amount',
 'Ehail_fee',
 'improvement_surcharge',
 'Total_amount',
 'Payment_type',
 'Trip_type ']
In [6]:
green.describe()
/Users/zeyu/anaconda/lib/python2.7/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)
Out[6]:
VendorID RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude Dropoff_latitude Passenger_count Trip_distance Fare_amount Extra MTA_tax Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount Payment_type Trip_type
count 1.494926e+06 1.494926e+06 1.494926e+06 1.494926e+06 1.494926e+06 1.494926e+06 1.494926e+06 1.494926e+06 1.494926e+06 1.494926e+06 1.494926e+06 1.494926e+06 1.494926e+06 0.0 1.494926e+06 1.494926e+06 1.494926e+06 1.494922e+06
mean 1.782045e+00 1.097653e+00 -7.383084e+01 4.069114e+01 -7.383728e+01 4.069291e+01 1.370598e+00 2.968141e+00 1.254320e+01 3.512800e-01 4.866408e-01 1.235727e+00 1.231047e-01 NaN 2.920991e-01 1.503215e+01 1.540559e+00 1.022353e+00
std 4.128570e-01 6.359437e-01 2.776082e+00 1.530882e+00 2.677911e+00 1.476698e+00 1.039426e+00 3.076621e+00 1.008278e+01 3.663096e-01 8.504473e-02 2.431476e+00 8.910137e-01 NaN 5.074009e-02 1.155316e+01 5.232935e-01 1.478288e-01
min 1.000000e+00 1.000000e+00 -8.331908e+01 0.000000e+00 -8.342784e+01 0.000000e+00 0.000000e+00 0.000000e+00 -4.750000e+02 -1.000000e+00 -5.000000e-01 -5.000000e+01 -1.529000e+01 NaN -3.000000e-01 -4.750000e+02 1.000000e+00 1.000000e+00
25% 2.000000e+00 1.000000e+00 -7.395961e+01 4.069895e+01 -7.396782e+01 4.069878e+01 1.000000e+00 1.100000e+00 6.500000e+00 0.000000e+00 5.000000e-01 0.000000e+00 0.000000e+00 NaN 3.000000e-01 8.160000e+00 1.000000e+00 NaN
50% 2.000000e+00 1.000000e+00 -7.394536e+01 4.074674e+01 -7.394504e+01 4.074728e+01 1.000000e+00 1.980000e+00 9.500000e+00 5.000000e-01 5.000000e-01 0.000000e+00 0.000000e+00 NaN 3.000000e-01 1.176000e+01 2.000000e+00 NaN
75% 2.000000e+00 1.000000e+00 -7.391748e+01 4.080255e+01 -7.391013e+01 4.079015e+01 1.000000e+00 3.740000e+00 1.550000e+01 5.000000e-01 5.000000e-01 2.000000e+00 0.000000e+00 NaN 3.000000e-01 1.830000e+01 2.000000e+00 NaN
max 2.000000e+00 9.900000e+01 0.000000e+00 4.317726e+01 0.000000e+00 4.279934e+01 9.000000e+00 6.031000e+02 5.805000e+02 1.200000e+01 5.000000e-01 3.000000e+02 9.575000e+01 NaN 3.000000e-01 5.813000e+02 5.000000e+00 2.000000e+00
In [7]:
# The mysterious 600 miles trip...
green[green.Trip_distance == max(green.Trip_distance)].loc[:,'Trip_distance':]
Out[7]:
Trip_distance Fare_amount Extra MTA_tax Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount Payment_type Trip_type
1321961 603.1 1.25 0.5 0.5 0.0 0.0 NaN 0.3 2.55 2 1.0

The above row looks like just bad data so I will simply remove it.

In [8]:
clean_green = green[(green.Passenger_count>0)
                    &(green.Trip_distance>0) & (green.Trip_distance!=max(green.Trip_distance))
                    &(green.Fare_amount>0)
                    &(green.Extra >=0)
                    &(green.MTA_tax>0)
                    &(green.Tip_amount>=0)
                    &(green.Tolls_amount>=0)
                    &(green.improvement_surcharge>=0)
                    &(green.Total_amount>0)]
   
In [9]:
clean_green.shape
Out[9]:
(1445000, 21)
In [10]:
clean_green.describe()
Out[10]:
VendorID RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude Dropoff_latitude Passenger_count Trip_distance Fare_amount Extra MTA_tax Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount Payment_type Trip_type
count 1.445000e+06 1.445000e+06 1.445000e+06 1.445000e+06 1.445000e+06 1.445000e+06 1.445000e+06 1.445000e+06 1.445000e+06 1.445000e+06 1445000.0 1.445000e+06 1.445000e+06 0.0 1.445000e+06 1.445000e+06 1.445000e+06 1.445000e+06
mean 1.783323e+00 1.005469e+00 -7.388521e+01 4.071978e+01 -7.389248e+01 4.072202e+01 1.372075e+00 2.982729e+00 1.245609e+01 3.611176e-01 0.5 1.238804e+00 1.194533e-01 NaN 2.998271e-01 1.497545e+01 1.532069e+00 1.000026e+00
std 4.119807e-01 1.140230e-01 1.936035e+00 1.068221e+00 1.773247e+00 9.785044e-01 1.044838e+00 2.955850e+00 9.106215e+00 3.651980e-01 0.0 2.245353e+00 8.499960e-01 NaN 7.200867e-03 1.057566e+01 5.168391e-01 5.060128e-03
min 1.000000e+00 1.000000e+00 -7.537244e+01 0.000000e+00 -7.525642e+01 0.000000e+00 1.000000e+00 1.000000e-02 1.000000e+00 0.000000e+00 0.5 0.000000e+00 0.000000e+00 NaN 0.000000e+00 1.800000e+00 1.000000e+00 1.000000e+00
25% 2.000000e+00 1.000000e+00 -7.396007e+01 4.069875e+01 -7.396846e+01 4.069866e+01 1.000000e+00 1.110000e+00 6.500000e+00 0.000000e+00 0.5 0.000000e+00 0.000000e+00 NaN 3.000000e-01 8.300000e+00 1.000000e+00 1.000000e+00
50% 2.000000e+00 1.000000e+00 -7.394617e+01 4.074651e+01 -7.394576e+01 4.074677e+01 1.000000e+00 2.000000e+00 9.500000e+00 5.000000e-01 0.5 0.000000e+00 0.000000e+00 NaN 3.000000e-01 1.176000e+01 2.000000e+00 1.000000e+00
75% 2.000000e+00 1.000000e+00 -7.391817e+01 4.080164e+01 -7.391149e+01 4.078777e+01 1.000000e+00 3.750000e+00 1.550000e+01 5.000000e-01 0.5 2.000000e+00 0.000000e+00 NaN 3.000000e-01 1.830000e+01 2.000000e+00 1.000000e+00
max 2.000000e+00 6.000000e+00 0.000000e+00 4.317726e+01 0.000000e+00 4.197125e+01 8.000000e+00 1.347000e+02 5.805000e+02 1.000000e+00 0.5 3.000000e+02 9.575000e+01 NaN 3.000000e-01 5.813000e+02 5.000000e+00 2.000000e+00

Question 2

  • Plot a histogram of the number of the trip distance ("Trip Distance").
  • Report any structure you find and any hypotheses you have about that structure.
In [11]:
import seaborn as sns
plt.figure(figsize=(20,10))
sns.plt.title('Trip Distance Distribution')
sns.distplot(clean_green.Trip_distance, kde=False)
/Users/zeyu/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
/Users/zeyu/anaconda/lib/python2.7/site-packages/numpy/lib/function_base.py:564: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  n = np.zeros(bins, ntype)
/Users/zeyu/anaconda/lib/python2.7/site-packages/numpy/lib/function_base.py:611: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  n += np.bincount(indices, weights=tmp_w, minlength=bins).astype(ntype)
Out[11]:

Looking through the histgram of trip distance, seems most of the trips were within 20 miles, only a few of long distance travel. By calculation, 96% of the trip were less than 10 miles and 99% were less than 20 mile. The median from the reports above is 2 miles

In [12]:
print clean_green[clean_green.Trip_distance<10].shape[0]/clean_green.shape[0]
print clean_green[clean_green.Trip_distance<20].shape[0]/clean_green.shape[0]
0.964588927336
0.998125259516

My hypotheses about this distribution is that it is not cost-effective for people to take a long distance trip on taxi. The average total amount is $14.8 for trips under 20 miles but $86 above 20 miles. People might choose other way of tranportation for long distance travel.

In [13]:
print mean(clean_green[clean_green.Trip_distance<20].Total_amount)
print mean(clean_green[clean_green.Trip_distance>=20].Total_amount)
14.841662688
86.2023735696

Question 3

  • Report mean and median trip distance grouped by hour of day.
  • We'd like to get a rough sense of identifying trips that originate or terminate at one of the NYC area airports. Can you provide a count of how many transactions fit this criteria, the average fair, and any other interesting characteristics of these trips.
In [14]:
clean_green['pickup_hour'] = pd.to_datetime(clean_green.lpep_pickup_datetime).apply(lambda x: x.hour)
In [15]:
# Mean and Median Trip Distance by hour of day
clean_green.groupby('pickup_hour').mean()[['Trip_distance']].transpose()
Out[15]:
pickup_hour 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
Trip_distance 3.139989 3.03841 3.07417 3.210847 3.53529 4.147171 4.081131 3.298662 3.05452 3.015967 ... 2.876524 2.865494 2.793347 2.690373 2.668716 2.728264 2.795564 3.013901 3.205027 3.215726

1 rows × 24 columns

In [16]:
clean_green.groupby('pickup_hour').median()[['Trip_distance']].transpose()
Out[16]:
pickup_hour 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
Trip_distance 2.23 2.16 2.19 2.26 2.4 2.94 2.9 2.2 2.0 2.0 ... 1.86 1.84 1.82 1.8 1.8 1.87 1.92 2.06 2.21 2.25

1 rows × 24 columns

In order to estimate if a pickup or drop-off location is at the Aiport (JFK, LGA and EWR), I used the geopy package: http://geopy.readthedocs.io/en/1.10.0/ to calulate the distance bewteen to coordinates. I took a naive approach that if the distance between the target coordinates and the airports is with 0.5 mile, I will consider that as the aiport locations.

In [17]:
from geopy.distance import vincenty
def nyc_airports(location,threshold = 0.5):
    """Helper function to determin if a coordinates is within the proximity of the airports """
    JFK = (40.639722,-73.778889)
    LGA = (40.77725, -73.872611)
    EWR = (40.6925, -74.168611)
    
    if vincenty(JFK, location).miles <=threshold:
        return 'JFK'
    elif vincenty(LGA, location).miles <=threshold:
        return 'LGA'
    elif vincenty(EWR, location).miles <=threshold:
        return 'EWR'
    else:
        return 'Other'
In [18]:
clean_green['pickup_coordinates'] = zip(clean_green.Pickup_latitude,clean_green.Pickup_longitude)
clean_green['dropoff_coordinates'] = zip(clean_green.Dropoff_latitude,clean_green.Dropoff_longitude)
In [19]:
clean_green['pickup_location'] = clean_green.pickup_coordinates.apply(nyc_airports)
clean_green['dropoff_location'] = clean_green.dropoff_coordinates.apply(nyc_airports)
In [20]:
clean_green.loc[:,'pickup_coordinates':].head()
Out[20]:
pickup_coordinates dropoff_coordinates pickup_location dropoff_location
2 (40.766708374, -73.9214096069) (40.7646865845, -73.9144134521) Other Other
3 (40.7666778564, -73.9213867188) (40.7715835571, -73.931427002) Other Other
4 (40.7140464783, -73.9554824829) (40.7147293091, -73.9444122314) Other Other
5 (40.8081855774, -73.9452972412) (40.8211975098, -73.9376678467) Other Other
6 (40.7464256287, -73.89087677) (40.7563056946, -73.8769226074) Other Other

Now let's look at how we are doing in finding trips that originate or terminate at one of the NYC area airports!

In [21]:
clean_green.groupby('pickup_location').size()
Out[21]:
pickup_location
EWR            2
JFK           40
LGA           84
Other    1444874
dtype: int64
In [22]:
clean_green.groupby('dropoff_location').size()
Out[22]:
dropoff_location
EWR          100
JFK         6631
LGA        11472
Other    1426797
dtype: int64

I am bit suprised that there were few trips originated from the airports but many termineted at the airport! However after googling about NYC green taxi (http://www.nyc.gov/html/tlc/html/passenger/shl_passenger.shtml), I found out that "Boro Taxi drivers can pick up passengers from the street in northern Manhattan (north of West 110th street and East 96th street), the Bronx, Queens (excluding the airports), Brooklyn and Staten Island and they may drop you off anywhere." So it turned out that these green taxi are not supposed to pickup at airports!

In [23]:
# Figuring out airport trips and non-airport trips 
masks = (clean_green.pickup_location != 'Other') | (clean_green.dropoff_location != 'Other')
airport_trips = clean_green[masks]
non_airport_trips = clean_green[~masks]
In [24]:
print airport_trips.shape[0]
print mean(airport_trips.Total_amount)
18300
34.1693060109

Transactions fit this criteria is about 18,300, and the average fare of the trip is $34.2. I'm also interested in the when these airport trips happened during the day comparing other trips and is there any difference in the tip percentage

In [25]:
plt.figure(figsize=(8,6))
sns.plt.title('Airport Trips Pickup Hour Distribution')
sns.distplot(airport_trips.pickup_hour,kde=False)
Out[25]:
In [26]:
plt.figure(figsize=(8,6))
sns.plt.title('Non-Airport Trips Pickup Hour Distribution')
sns.distplot(non_airport_trips.pickup_hour,kde=False)
Out[26]:

It's interesting to notice the difference in distributions by hours. For Aiport trips, seems most of the trips happened in the morning and afternoon. Very few trips were observed after 8:00 pm, probbably has to do with fewer flights in later everning. For non-aiport trips, we can see the spike around 6:00 pm and declining throughout the evening. It could be people going home after work or after night life.

In [27]:
mean(non_airport_trips.Tip_amount/(non_airport_trips.Total_amount-non_airport_trips.Tip_amount))
Out[27]:
0.083623266354215811
In [28]:
mean(airport_trips.Tip_amount/(airport_trips.Total_amount-airport_trips.Tip_amount))
Out[28]:
0.12727342028424771

It's also interesting to see the difference between the tip percentage in airport and non-airport trips. 12.7% for airport trips and 8.4% for non-airport trips. It could be that passenesers got help for their luggages are more likely to tip more for their drivers.

Question 4

  • Build a derived variable for tip as a percentage of the total fare.
  • Build a predictive model for tip as a percentage of the total fare. Use as much of the data as you like (or all of it). We will validate a sample.

For better capturing tip as a percentage of the total fare, I will only select the credit card transactions since according to the data dictionary Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.

In [29]:
tip_data = clean_green[clean_green.Payment_type == 1 ]
In [30]:
#Build the target variable tip as a percentage of the total fare
tip_data['tip_perc'] = tip_data.Tip_amount/(tip_data.Total_amount-tip_data.Tip_amount)
In [31]:
plt.figure(figsize=(20,10))
sns.plt.title('Tip Percentage Distribution')
plt.xlim(0, 1)
sns.distplot(tip_data.tip_perc,kde=False)
Out[31]:

It' interesting to see spikes aornd 20%, 25% 30% , since these are defalut tip percentage during checkout.

Feature Engineering and Selection

In [32]:
# Creating a variable trip_time in minutes
tip_data['trip_time'] = (pd.to_datetime(clean_green.Lpep_dropoff_datetime)-
                         pd.to_datetime(clean_green.lpep_pickup_datetime)).astype('timedelta64[s]')/60
In [33]:
# Creating a variable total_fare for the amount paid exclusing tips
tip_data['total_fare'] = tip_data.Total_amount-tip_data.Tip_amount
In [34]:
# Creating dummy variables for hour of the day
dummies = pd.get_dummies(tip_data['pickup_hour'],prefix='hour')
tip_data = pd.concat([tip_data, dummies], axis=1)  
In [35]:
# Creating a variable for airport trips
tip_data['airport'] = ((tip_data.pickup_location != 'Other') | (tip_data.dropoff_location != 'Other')).astype(int)
In [36]:
#feature columns
X_columns = ['Passenger_count','Trip_distance','trip_time','total_fare','airport',
     'hour_0','hour_1','hour_2','hour_3','hour_4','hour_5','hour_6','hour_7','hour_8',
     'hour_9','hour_10','hour_11','hour_12','hour_13','hour_14','hour_15','hour_16',
     'hour_17','hour_18','hour_19','hour_20','hour_21','hour_22','hour_23']
y_column = ['tip_perc']

Running a linear regression to predict the tip percentage

In [52]:
from sklearn import feature_selection as f_select
from sklearn import linear_model as lm
from sklearn import metrics
from sklearn import cross_validation as cv

# Selecting signicant columns be testing each singel regressors
significant_columns = []
pvals = []
for feature in X_columns:
    pval = f_select.f_regression(tip_data[[feature]],tip_data[y_column].values.ravel())
    if pval[1][0] < 0.05:
        significant_columns.append(feature)
        pvals.append(pval[1][0])

# Train Test Split the data 
x_train, x_test, y_train, y_test = cv.train_test_split(tip_data[significant_columns],
                                                       tip_data[y_column],
                                                           test_size=0.333,
                                                           random_state=1234)
model = lm.LinearRegression().fit(x_train, y_train)

Model Evaluation and Explaination

In [39]:
print pd.DataFrame({'column': significant_columns, 
                    'coef': model.coef_.tolist()[0],
                    'p-values':pvals}).set_index('column')
print 
print 'Metrics on Training Set'
print 'R^2 :', metrics.r2_score(y_train, model.predict(x_train))
print 'MSE :', metrics.mean_squared_error(y_train, model.predict(x_train))
print 
print 'Metrics on Testing Set'
print 'R^2 :', metrics.r2_score(y_test, model.predict(x_test))
print 'MSE :', metrics.mean_squared_error(y_test, model.predict(x_test))
                   coef       p-values
column                                
Trip_distance  0.012037  4.525628e-139
trip_time     -0.000042   1.757728e-72
total_fare    -0.004741  9.178729e-286
airport        0.019419   1.169952e-09
hour_2         0.002787   9.834192e-03
hour_7        -0.008986   1.584164e-09
hour_8        -0.004438   1.801781e-08
hour_13        0.004350   2.962384e-02
hour_16       -0.001402   1.779203e-03
hour_17        0.000407   7.124455e-03
hour_20        0.001030   1.507279e-02
hour_21        0.003077   1.526063e-04
hour_22        0.005383   3.061552e-05

Metrics on Training Set
R^2 : 0.00469943758079
MSE : 0.0573998677298

Metrics on Testing Set
R^2 : 0.00214640967101
MSE : 0.0513595260948

From the coefficients we can find intersting things about these features:

  • Trip Distance is positvely correlated with the tip percentage while both trip time and total fare has negative impact.
  • Trip to airports seems to be correlated with higher tip percentage
  • Late night trips seems to indicate a higher tip percentage

Overall the linear regression model performs not very well with a merely 0.005 R^2 on training and 0.002 on testing, which means that with these features we used we can only explain a tiny portion of the variance in the data. Typically when attempts to predict human behavior, it is relative hard to build a good model.

Build a Classifier for high and low perentage tips

Instead of doing a regression on the actually percentage of tip, I would like to build a classifier to see if we can predict there will be a large tip percentage using 20% as the threshold.

In [40]:
tip_data['high_tip'] = (tip_data.tip_perc > 0.2).astype(int)

About 25% of the trips have higher than 20% tip

In [41]:
sum(tip_data.high_tip)/len(tip_data.high_tip)
Out[41]:
0.25161247054981456
In [42]:
# feature columns
X_columns = ['Passenger_count','Trip_distance','trip_time','total_fare','airport',
     'hour_0','hour_1','hour_2','hour_3','hour_4','hour_5','hour_6','hour_7','hour_8',
     'hour_9','hour_10','hour_11','hour_12','hour_13','hour_14','hour_15','hour_16',
     'hour_17','hour_18','hour_19','hour_20','hour_21','hour_22','hour_23']
y_column = ['high_tip']
In [43]:
# Train Test Split the data 
x_train, x_test, y_train, y_test = cv.train_test_split(tip_data[X_columns],
                                                       tip_data[y_column],
                                                           test_size=0.333,
                                                           random_state=1234)
In [46]:
# Build a random forest clf
from sklearn.ensemble import RandomForestClassifier
treeclf = RandomForestClassifier(n_estimators=10, max_depth=None,min_samples_split=1, random_state=0)
treeclf = treeclf.fit(x_train, y_train.values.ravel())
In [47]:
# Evalute the models
print 'fpr', metrics.roc_curve(y_test, treeclf.predict(x_test))[0][1] #fpr
print 'tpr', metrics.roc_curve(y_test, treeclf.predict(x_test))[1][1] #tpr
print 'precision', metrics.precision_score(y_test, treeclf.predict(x_test))
print 'accuracy', metrics.accuracy_score(y_test, treeclf.predict(x_test))

# ROC 
roc = metrics.roc_curve(y_test, treeclf.predict(x_test))
plt.figure()

plt.plot([0, 0.5, 1], [0, 0.5, 1])
plt.plot(roc[0], roc[1])
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
fpr 0.11047146402
tpr 0.20624168592
precision 0.383761073476
accuracy 0.718616329622
Out[47]:

The Random Forest Classifer is still not doing a very good job identifying the postives with only a 20% Recall and 38 % precison. For next steps I would spent more time on the feature engineering side to extract more features from thess dataset and continue to try different optimization and other classifiers such as SVM

Question 5

In this section, I was curious to know if the airport trips I identified in previous questions were accurate. So I plotted the pickup and dropoff location on the NYC area map. The results looks pretty good to me.

In [48]:
def inline_map(map):
    """
    Embeds the HTML source of the map directly into the IPython notebook.
    
    This method will not work if the map depends on any files (json data). Also this uses
    the HTML5 srcdoc attribute, which may not be supported in all browsers.
    """
    map._build_map()
    return HTML(''.format(srcdoc=map.HTML.replace('"', '"')))
In [49]:
from IPython.display import HTML
import folium
NYC = (40.7142700, -74.0059700)
map = folium.Map(location=NYC,zoom_start=12)
for each in airport_trips.iterrows():
    map.simple_marker(
        location = [each[1]['Pickup_latitude'],each[1]['Pickup_longitude']], 
        clustered_marker = True)
inline_map(map) 
Out[49]:
In [50]:
map = folium.Map(location=NYC)
for each in airport_trips.iterrows():
    map.simple_marker(
        location = [each[1]['Dropoff_latitude'],each[1]['Dropoff_longitude']], 
        clustered_marker = True)
In [51]:
inline_map(map)
Out[51]:
In [ ]:
 

blogroll

social